Load Required Packages

First, we load all of required packages in this project at once. This allowed the project partners to load and work on the same versions of the packages.

1. Dataset and Introduction

We have choosen ECC dataset because it is about healthcare in what we are more professionally interested.

Objective: Early childhood caries (ECC) is a potentially severe disease affecting children all over the world [1]. The available findings are mostly based on a logistic regressionmodel, but data mining could be used to extract moreinformation from the same data set. In the paper, authors implement association rule mining for interpretability. While interpretability of the model is important, we seek other methods for classification and clustering with better performance.

Secondly, we import the training, test and validation splits of ECC datasets.

2. Descriptrive Statistics

#READ DATA
TRAIN = read.csv("./ECC_train.csv")
VALIDATION = read.csv("./ECC_validation.csv")
TEST = read.csv("./ECC_test.csv")
## 3. Classification Methods
options(knitr.kable.NA = '')
#summary of the dataset gives us the brief information.
kable(summary(TRAIN)) 
CITY CHILD_ETHNICITY CHILD_AGE CHILD_GENDER CHILD_SERBIAN_LANGUAGE MOTHER_AGE MARITAL_STATUS MOTHER_ETHNICITY MOTHER_SERBIAN_LANGUAGE NUMBER_OF_CHILDREN BIRTH_ORDER MOTHER_EDUCATION_LEVEL MOTHER_EMPLOYMENT_STATUS QUALITY_OF_HOUSING HOUSING_CONDITIONS HOUSEHOLD_MONTHLY_INCOME BIRTH_WEIGHT BREASTFEEDING BREASTFEEDING_FREQUENCY BREASTFEEDING_DURING_NIGHT BOTTLE_FEEDING INFANT_FORMULAS ADDITIONAL_FOOD_SWEETENING CHILD_FLUORIDE_SUPPLEMENTS CHILD_FLUORIDE_TOOTHPASTE CHILD_ORAL_HYGIENE CHILD_TOOTH_BRUSHING DIARRHEA_DURING_INFANCY MEDICAL_SYRUPS CHILD_FIRST_DENTIST_VISIT SWEETS_DURING_PREGNANCY FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY ORAL_HEALTH_DURING_PREGNANCY MOTHER_HEALTH_AWARENESS FATHER_HEALTH_AWARENESS ECC
NOVI_SAD :79 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. : 1.00 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000 Min. :1 Min. : 1 Min. : 1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.0 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
BACKA_PALANKA:42 1st Qu.:1.000 1st Qu.:3.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 1.00 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.0 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1 1st Qu.: 2 1st Qu.: 1.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
KISAC :29 Median :1.000 Median :3.00 Median :1.000 Median :1.000 Median :3.000 Median :1.000 Median : 1.00 Median :2.000 Median :2.0 Median :1.000 Median :3.000 Median :3.000 Median :2.000 Median :1.0 Median :4.000 Median :2.000 Median :2 Median : 2 Median : 1.00 Median :2.000 Median :2.000 Median :2.000 Median :3.000 Median :1.000 Median :2.000 Median :2.000 Median :2.0 Median :2.000 Median :3.000 Median :2.000 Median :3.000 Median :2.000 Median :2.000 Median :2.000 Median :2.000
RUSKI_KRSTUR :23 Mean :2.167 Mean :3.13 Mean :1.473 Mean :1.134 Mean :2.427 Mean :1.238 Mean : 22.91 Mean :1.732 Mean :1.9 Mean :1.678 Mean :3.008 Mean :2.427 Mean :1.849 Mean :1.1 Mean :3.347 Mean :1.908 Mean :2 Mean :119 Mean : 93.01 Mean :2.427 Mean :1.565 Mean :2.297 Mean :2.707 Mean :1.397 Mean :1.879 Mean :2.146 Mean :1.9 Mean :2.423 Mean :3.159 Mean :1.854 Mean :2.431 Mean :1.799 Mean :2.059 Mean :1.874 Mean :1.703
TITEL :22 3rd Qu.:3.000 3rd Qu.:4.00 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 3.00 3rd Qu.:2.000 3rd Qu.:2.0 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:1.0 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3 3rd Qu.: 3 3rd Qu.: 1.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:2.0 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
TEMERIN :17 Max. :7.000 Max. :5.00 Max. :2.000 Max. :2.000 Max. :3.000 Max. :3.000 Max. :999.00 Max. :2.000 Max. :3.0 Max. :3.000 Max. :4.000 Max. :4.000 Max. :3.000 Max. :2.0 Max. :5.000 Max. :2.000 Max. :4 Max. :999 Max. :999.00 Max. :4.000 Max. :2.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :4.000 Max. :2.0 Max. :3.000 Max. :4.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000 Max. :2.000
(Other) :27
for (col in 2:ncol(TRAIN)) {
  hist(TRAIN[,col], main = paste("Histogram of", colnames(TRAIN)[col]))
}

for (col in 2:ncol(TRAIN)) {
  qqnorm(TRAIN[,col], main = paste("Normal QQ Plot of ",colnames(TRAIN)[col])); qqline(TRAIN[,col])
}

Descriptive Location Measures for Each of the Numerical Attributes

Geometric Mean:

  • We already achieved mean and median values of each attributes with summary() command.
  • Besides that, geometric mean is an important measure of the central tendency.
geomean = matrix(0,36,1)
for (col in 2:ncol(TRAIN)) {
  geomean[col] = exp(mean(log(TRAIN[,col])))    
}
#geomean
geomean_vector <- data.frame(geomean)
row.names(geomean_vector) <- colnames(TRAIN)
kable(geomean_vector,row.names = TRUE)
geomean
CITY 0.000000
CHILD_ETHNICITY 1.758148
CHILD_AGE 3.018245
CHILD_GENDER 1.387803
CHILD_SERBIAN_LANGUAGE 1.097249
MOTHER_AGE 2.285236
MARITAL_STATUS 1.158650
MOTHER_ETHNICITY 1.887356
MOTHER_SERBIAN_LANGUAGE 1.661191
NUMBER_OF_CHILDREN 1.741823
BIRTH_ORDER 1.513556
MOTHER_EDUCATION_LEVEL 2.908949
MOTHER_EMPLOYMENT_STATUS 2.217976
QUALITY_OF_HOUSING 1.632377
HOUSING_CONDITIONS 1.072084
HOUSEHOLD_MONTHLY_INCOME 3.108337
BIRTH_WEIGHT 1.876377
BREASTFEEDING 1.754348
BREASTFEEDING_FREQUENCY 4.413374
BREASTFEEDING_DURING_NIGHT 2.084179
BOTTLE_FEEDING 2.240267
INFANT_FORMULAS 1.479237
ADDITIONAL_FOOD_SWEETENING 2.165548
CHILD_FLUORIDE_SUPPLEMENTS 2.638544
CHILD_FLUORIDE_TOOTHPASTE 1.272027
CHILD_ORAL_HYGIENE 1.799259
CHILD_TOOTH_BRUSHING 1.968312
DIARRHEA_DURING_INFANCY 1.865525
MEDICAL_SYRUPS 2.368098
CHILD_FIRST_DENTIST_VISIT 2.985174
SWEETS_DURING_PREGNANCY 1.783182
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 2.259006
ORAL_HEALTH_DURING_PREGNANCY 1.650330
MOTHER_HEALTH_AWARENESS 1.992148
FATHER_HEALTH_AWARENESS 1.783284
ECC 1.627806

Besides the central tendency, the fact that how closely the data fall about the center is another issue. We need to figure out the spread pattern around the center.

Range:

rangeVector = matrix(0,36,1)
for (col in 2:ncol(TRAIN)) {
  rangeVector[col] = max(TRAIN[,col], na.rm = TRUE)-min(TRAIN[,col], na.rm = TRUE)  
}

range_Vector <- data.frame(rangeVector)
row.names(range_Vector) <- colnames(TRAIN)
kable(range_Vector,row.names = TRUE)
rangeVector
CITY 0
CHILD_ETHNICITY 6
CHILD_AGE 4
CHILD_GENDER 1
CHILD_SERBIAN_LANGUAGE 1
MOTHER_AGE 2
MARITAL_STATUS 2
MOTHER_ETHNICITY 998
MOTHER_SERBIAN_LANGUAGE 1
NUMBER_OF_CHILDREN 2
BIRTH_ORDER 2
MOTHER_EDUCATION_LEVEL 3
MOTHER_EMPLOYMENT_STATUS 3
QUALITY_OF_HOUSING 2
HOUSING_CONDITIONS 1
HOUSEHOLD_MONTHLY_INCOME 4
BIRTH_WEIGHT 1
BREASTFEEDING 3
BREASTFEEDING_FREQUENCY 998
BREASTFEEDING_DURING_NIGHT 998
BOTTLE_FEEDING 3
INFANT_FORMULAS 1
ADDITIONAL_FOOD_SWEETENING 2
CHILD_FLUORIDE_SUPPLEMENTS 2
CHILD_FLUORIDE_TOOTHPASTE 2
CHILD_ORAL_HYGIENE 2
CHILD_TOOTH_BRUSHING 3
DIARRHEA_DURING_INFANCY 1
MEDICAL_SYRUPS 2
CHILD_FIRST_DENTIST_VISIT 3
SWEETS_DURING_PREGNANCY 2
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 2
ORAL_HEALTH_DURING_PREGNANCY 2
MOTHER_HEALTH_AWARENESS 2
FATHER_HEALTH_AWARENESS 2
ECC 1

Interquantile Range

iqc = matrix(0,36,1)
for (col in 2:ncol(TRAIN)) {
  iqc[col] = IQR(TRAIN[,col])   
}

iqr_vector <- data.frame(iqc)
row.names(iqr_vector) <- colnames(TRAIN)
kable(iqr_vector, row.names = TRUE)
iqc
CITY 0
CHILD_ETHNICITY 2
CHILD_AGE 1
CHILD_GENDER 1
CHILD_SERBIAN_LANGUAGE 0
MOTHER_AGE 1
MARITAL_STATUS 0
MOTHER_ETHNICITY 2
MOTHER_SERBIAN_LANGUAGE 1
NUMBER_OF_CHILDREN 1
BIRTH_ORDER 1
MOTHER_EDUCATION_LEVEL 0
MOTHER_EMPLOYMENT_STATUS 1
QUALITY_OF_HOUSING 2
HOUSING_CONDITIONS 0
HOUSEHOLD_MONTHLY_INCOME 1
BIRTH_WEIGHT 0
BREASTFEEDING 2
BREASTFEEDING_FREQUENCY 1
BREASTFEEDING_DURING_NIGHT 0
BOTTLE_FEEDING 1
INFANT_FORMULAS 1
ADDITIONAL_FOOD_SWEETENING 1
CHILD_FLUORIDE_SUPPLEMENTS 1
CHILD_FLUORIDE_TOOTHPASTE 1
CHILD_ORAL_HYGIENE 0
CHILD_TOOTH_BRUSHING 1
DIARRHEA_DURING_INFANCY 0
MEDICAL_SYRUPS 1
CHILD_FIRST_DENTIST_VISIT 2
SWEETS_DURING_PREGNANCY 0
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 1
ORAL_HEALTH_DURING_PREGNANCY 1
MOTHER_HEALTH_AWARENESS 0
FATHER_HEALTH_AWARENESS 0
ECC 1

Variance

variance = matrix(0,36,1)

for (col in 2:ncol(TRAIN)) {
  variance[col] = var(TRAIN[,col])      
}
var_vector <- data.frame(variance)
row.names(var_vector) <- colnames(TRAIN)
kable(var_vector, row.names = TRUE)
variance
CITY 0.000000e+00
CHILD_ETHNICITY 2.198762e+00
CHILD_AGE 6.091558e-01
CHILD_GENDER 2.503077e-01
CHILD_SERBIAN_LANGUAGE 1.164516e-01
MOTHER_AGE 5.229774e-01
MARITAL_STATUS 3.084280e-01
MOTHER_ETHNICITY 2.044553e+04
MOTHER_SERBIAN_LANGUAGE 1.968988e-01
NUMBER_OF_CHILDREN 5.697057e-01
BIRTH_ORDER 6.058507e-01
MOTHER_EDUCATION_LEVEL 4.537112e-01
MOTHER_EMPLOYMENT_STATUS 7.414648e-01
QUALITY_OF_HOUSING 8.175521e-01
HOUSING_CONDITIONS 9.071410e-02
HOUSEHOLD_MONTHLY_INCOME 1.194016e+00
BIRTH_WEIGHT 8.392810e-02
BREASTFEEDING 1.058823e+00
BREASTFEEDING_FREQUENCY 1.032018e+05
BREASTFEEDING_DURING_NIGHT 8.356663e+04
BOTTLE_FEEDING 8.254984e-01
INFANT_FORMULAS 2.468268e-01
ADDITIONAL_FOOD_SWEETENING 4.954116e-01
CHILD_FLUORIDE_SUPPLEMENTS 2.752013e-01
CHILD_FLUORIDE_TOOTHPASTE 4.841954e-01
CHILD_ORAL_HYGIENE 2.583243e-01
CHILD_TOOTH_BRUSHING 7.557751e-01
DIARRHEA_DURING_INFANCY 9.071410e-02
MEDICAL_SYRUPS 2.618403e-01
CHILD_FIRST_DENTIST_VISIT 8.653704e-01
SWEETS_DURING_PREGNANCY 2.179600e-01
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 6.160121e-01
ORAL_HEALTH_DURING_PREGNANCY 5.309237e-01
MOTHER_HEALTH_AWARENESS 2.486551e-01
FATHER_HEALTH_AWARENESS 3.035055e-01
ECC 2.096973e-01

Coefficient of Variance

CV = matrix(0,36,1)
for (col in 2:ncol(TRAIN)) {
  CV[col] = sd(TRAIN[,col], na.rm=TRUE)/mean(TRAIN[,col], na.rm=TRUE)*100       
}
CV_vector <- data.frame(CV)
row.names(CV_vector) <- colnames(TRAIN)
kable(CV_vector, row.names = TRUE)
CV
CITY 0.00000
CHILD_ETHNICITY 68.41594
CHILD_AGE 24.93794
CHILD_GENDER 33.96975
CHILD_SERBIAN_LANGUAGE 30.09548
MOTHER_AGE 29.79966
MARITAL_STATUS 44.84180
MOTHER_ETHNICITY 624.07043
MOTHER_SERBIAN_LANGUAGE 25.61646
NUMBER_OF_CHILDREN 39.73446
BIRTH_ORDER 46.39128
MOTHER_EDUCATION_LEVEL 22.39024
MOTHER_EMPLOYMENT_STATUS 35.48258
QUALITY_OF_HOUSING 48.89150
HOUSING_CONDITIONS 27.37030
HOUSEHOLD_MONTHLY_INCOME 32.64472
BIRTH_WEIGHT 15.18402
BREASTFEEDING 51.44958
BREASTFEEDING_FREQUENCY 270.01530
BREASTFEEDING_DURING_NIGHT 310.80960
BOTTLE_FEEDING 37.43933
INFANT_FORMULAS 31.74844
ADDITIONAL_FOOD_SWEETENING 30.64140
CHILD_FLUORIDE_SUPPLEMENTS 19.37844
CHILD_FLUORIDE_TOOTHPASTE 49.79225
CHILD_ORAL_HYGIENE 27.05417
CHILD_TOOTH_BRUSHING 40.50203
DIARRHEA_DURING_INFANCY 15.85548
MEDICAL_SYRUPS 21.12212
CHILD_FIRST_DENTIST_VISIT 29.44774
SWEETS_DURING_PREGNANCY 25.18735
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 32.28616
ORAL_HEALTH_DURING_PREGNANCY 40.49911
MOTHER_HEALTH_AWARENESS 24.22320
FATHER_HEALTH_AWARENESS 29.39024
ECC 26.89056
  • Coefficient of variance is a better parameter to see the behaviour of the data. Because it gives more logical results in the attributes with different scales.

Correlation & Covariance

  • Correlation and Covariance matrixes will be very helpful in our Feature Selection process. It is not wise to use two highly correlated attributes in the same model. Because, this situation would result with overfitting problems.
options(knitr.kable.NA = '')
NUM=data.frame(TRAIN[2:36])

# correlations/covariance
kable(cov(NUM))
CHILD_ETHNICITY CHILD_AGE CHILD_GENDER CHILD_SERBIAN_LANGUAGE MOTHER_AGE MARITAL_STATUS MOTHER_ETHNICITY MOTHER_SERBIAN_LANGUAGE NUMBER_OF_CHILDREN BIRTH_ORDER MOTHER_EDUCATION_LEVEL MOTHER_EMPLOYMENT_STATUS QUALITY_OF_HOUSING HOUSING_CONDITIONS HOUSEHOLD_MONTHLY_INCOME BIRTH_WEIGHT BREASTFEEDING BREASTFEEDING_FREQUENCY BREASTFEEDING_DURING_NIGHT BOTTLE_FEEDING INFANT_FORMULAS ADDITIONAL_FOOD_SWEETENING CHILD_FLUORIDE_SUPPLEMENTS CHILD_FLUORIDE_TOOTHPASTE CHILD_ORAL_HYGIENE CHILD_TOOTH_BRUSHING DIARRHEA_DURING_INFANCY MEDICAL_SYRUPS CHILD_FIRST_DENTIST_VISIT SWEETS_DURING_PREGNANCY FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY ORAL_HEALTH_DURING_PREGNANCY MOTHER_HEALTH_AWARENESS FATHER_HEALTH_AWARENESS ECC
CHILD_ETHNICITY 2.1987624 -0.2024718 0.0339826 0.1581695 -0.2397947 0.0481523 52.6828346 -0.1776836 0.2899863 0.3062480 -0.5056081 -0.4834921 0.1135509 0.1806019 -0.7306354 -0.0727647 0.2857143 4.749161e+01 1.3431314 0.0333146 -0.0403115 0.0467107 0.0702331 0.2147076 0.1758553 0.2232868 -0.1511902 0.0760346 0.2295805 -0.0888330 0.0830315 0.3362751 -0.2241307 -0.3318449 -0.1265427
CHILD_AGE -0.2024718 0.6091558 0.0266517 -0.0300447 0.1040751 0.0109525 9.6080834 0.0138708 -0.0079287 -0.0336662 0.0703386 0.0788650 -0.0434056 -0.0256848 0.1480433 0.0119897 -0.0672269 -1.105976e+01 -7.7784009 -0.0303787 0.0440737 -0.0134841 -0.0416828 -0.0391688 -0.0178088 -0.1661334 0.0130797 -0.0340354 -0.1005415 0.0190746 -0.0309237 -0.0452692 0.0427903 0.0415597 0.0344925
CHILD_GENDER 0.0339826 0.0266517 0.2503077 -0.0131500 -0.0639745 0.0086143 -5.6851728 -0.0325235 0.0056608 -0.0151014 -0.0207799 -0.0051510 0.0421047 0.0111459 0.0073837 -0.0025140 -0.0336134 -1.356791e+01 -14.2560740 0.0368658 0.0007208 0.0060124 -0.0164024 0.0087550 -0.0012130 -0.0485215 0.0056608 -0.0199712 -0.0208678 0.0233114 0.0558876 -0.0096867 -0.0110052 0.0091769 0.0444077
CHILD_SERBIAN_LANGUAGE 0.1581695 -0.0300447 -0.0131500 0.1164516 -0.0447769 0.0057487 -2.6184382 -0.0228192 0.0555184 0.0559228 -0.0725537 -0.0615836 0.0160508 0.0159101 -0.0887100 -0.0128336 0.0000000 -3.126877e+00 -3.9717134 0.0014416 -0.0087198 0.0188812 0.0099680 0.0137829 0.0247178 0.0223269 -0.0243135 -0.0148026 0.0626560 0.0154882 0.0008790 0.0354066 -0.0246827 -0.0167364 -0.0272846
MOTHER_AGE -0.2397947 0.1040751 -0.0639745 -0.0447769 0.5229774 -0.0517914 3.3317746 0.0601420 0.0304314 0.0708484 0.1728842 0.1742379 -0.1245209 -0.0724482 0.2713336 0.0562568 -0.0378151 1.662213e-01 6.7569178 -0.0064344 -0.0067860 0.0533561 -0.0467459 -0.0905207 -0.0782497 -0.0585598 0.0514398 0.0121655 -0.0723427 0.0333497 -0.0880595 -0.1114061 0.0967441 0.0790057 0.0096691
MARITAL_STATUS 0.0481523 0.0109525 0.0086143 0.0057487 -0.0517914 0.3084280 3.4160015 -0.0535143 -0.0305721 -0.0194789 -0.0482226 -0.0349847 0.0276713 0.0347737 -0.0789705 0.0052389 0.0294118 1.392198e+01 7.3551387 -0.0223797 -0.0050280 -0.0081221 0.0113217 0.0434584 0.0710770 0.0951795 -0.0221687 -0.0675961 0.0291481 -0.0111459 0.0564502 0.0018811 -0.0644492 -0.0329630 -0.0212897
MOTHER_ETHNICITY 52.6828346 9.6080834 -5.6851728 -2.6184382 3.3317746 3.4160015 20445.5258606 1.2410780 2.3230723 2.7782954 -0.6589255 -1.0085616 -0.9502655 -1.9113076 4.5684575 1.8406174 4.4117647 1.753670e+03 -1927.9782532 -0.5337717 -3.4669667 2.2320945 1.9909637 0.2367533 -1.4686896 -2.8484231 -2.2147428 -0.3912837 1.0518442 -5.4078795 -4.8317218 0.2974052 2.7656728 -10.2085546 -6.4421785
MOTHER_SERBIAN_LANGUAGE -0.1776836 0.0138708 -0.0325235 -0.0228192 0.0601420 -0.0535143 1.2410780 0.1968988 -0.0732218 -0.0446187 0.0694772 0.0937555 -0.0363032 -0.0402236 0.1101930 0.0172638 -0.0084034 6.266358e+00 3.7417461 -0.0196899 0.0216413 -0.0251573 -0.0241377 -0.0611793 -0.0536374 -0.0866707 0.0234169 -0.0040083 -0.0412784 0.0068387 -0.0311698 -0.0414015 0.0409620 0.0166661 -0.0000527
NUMBER_OF_CHILDREN 0.2899863 -0.0079287 0.0056608 0.0555184 0.0304314 -0.0305721 2.3230723 -0.0732218 0.5697057 0.4759151 -0.1504167 -0.2174677 0.0604409 0.0689498 -0.2885095 -0.0260891 -0.0420168 -9.124380e+00 -11.7260469 0.0304314 0.0065399 -0.0036567 0.0166837 0.0064695 0.0591927 0.0483809 -0.0521430 0.0005977 0.1714954 -0.0525825 -0.0069618 0.0973946 -0.1159418 -0.0798847 -0.0383601
BIRTH_ORDER 0.3062480 -0.0336662 -0.0151014 0.0559228 0.0708484 -0.0194789 2.7782954 -0.0446187 0.4759151 0.6058507 -0.1527548 -0.2064625 0.0437045 0.0661018 -0.2657959 0.0038325 -0.0042017 -4.066946e+00 -8.0309061 0.0162266 0.0020745 0.0204810 0.0102844 0.0025491 0.0783904 0.0893956 -0.0408917 -0.0313456 0.1186667 -0.0347737 -0.0244366 0.1156957 -0.1070989 -0.0826272 -0.0666995
MOTHER_EDUCATION_LEVEL -0.5056081 0.0703386 -0.0207799 -0.0725537 0.1728842 -0.0482226 -0.6589255 0.0694772 -0.1504167 -0.1527548 0.4537112 0.2989346 -0.1247846 -0.1016842 0.4634682 0.0427903 -0.0546218 -9.432562e+00 -0.7563728 0.0342288 0.0120601 0.0563271 -0.0521606 -0.1083823 -0.0998207 -0.1188777 0.0890791 -0.0287613 -0.1399916 0.0642558 -0.0750501 -0.1705812 0.1759783 0.1607187 0.0823283
MOTHER_EMPLOYMENT_STATUS -0.4834921 0.0788650 -0.0051510 -0.0615836 0.1742379 -0.0349847 -1.0085616 0.0937555 -0.2174677 -0.2064625 0.2989346 0.7414648 -0.1287226 -0.1102634 0.5192328 0.0646602 0.0126050 8.527566e+00 15.1098590 0.0439858 0.0268275 0.0953729 -0.0593509 -0.1031258 -0.0740480 -0.0837699 0.0766499 0.0037622 -0.1521747 0.0375514 -0.0418410 -0.1366162 0.1765761 0.1630393 0.0852994
QUALITY_OF_HOUSING 0.1135509 -0.0434056 0.0421047 0.0160508 -0.1245209 0.0276713 -0.9502655 -0.0363032 0.0604409 0.0437045 -0.1247846 -0.1287226 0.8175521 0.0488028 -0.2037727 -0.0265286 0.0546218 -7.444974e+00 -11.2382300 -0.0488907 -0.0111986 0.0239267 0.0691431 0.0349144 0.0110580 0.0431595 -0.0277944 0.0303084 0.0198481 0.0324707 0.1282128 0.0914701 -0.0793748 -0.0778102 -0.0365318
HOUSING_CONDITIONS 0.1806019 -0.0256848 0.0111459 0.0159101 -0.0724482 0.0347737 -1.9113076 -0.0402236 0.0689498 0.0661018 -0.1016842 -0.1102634 0.0488028 0.0907141 -0.1358602 -0.0159277 0.0420168 8.008509e-01 -5.0638691 0.0073837 -0.0065399 0.0120601 0.0253331 0.0649590 0.0626560 0.0692662 -0.0444956 0.0036040 0.0427903 -0.0272494 0.0531803 0.0916810 -0.0773355 -0.0671741 -0.0120601
HOUSEHOLD_MONTHLY_INCOME -0.7306354 0.1480433 0.0073837 -0.0887100 0.2713336 -0.0789705 4.5684575 0.1101930 -0.2885095 -0.2657959 0.4634682 0.5192328 -0.2037727 -0.1358602 1.1940157 0.0993284 -0.0588235 -2.405427e+01 -6.8810696 0.0738546 0.0509124 0.1400970 -0.0323125 -0.1596287 -0.0879364 -0.1309026 0.1442636 -0.0381316 -0.2445238 0.0846841 -0.0830667 -0.2072712 0.2484793 0.2118421 0.0951971
BIRTH_WEIGHT -0.0727647 0.0119897 -0.0025140 -0.0128336 0.0562568 0.0052389 1.8406174 0.0172638 -0.0260891 0.0038325 0.0427903 0.0646602 -0.0265286 -0.0159277 0.0993284 0.0839281 0.0000000 -1.783833e+00 0.1058155 0.0058366 0.0059949 0.0022503 0.0065399 -0.0094758 -0.0112162 0.0177385 0.0201294 -0.0113568 0.0020921 0.0158750 -0.0063816 -0.0311698 0.0180198 0.0178088 0.0271615
BREASTFEEDING 0.2857143 -0.0672269 -0.0336134 0.0000000 -0.0378151 0.0294118 4.4117647 -0.0084034 -0.0420168 -0.0042017 -0.0546218 0.0126050 0.0546218 0.0420168 -0.0588235 0.0000000 1.0588235 2.220378e+02 184.4873950 -0.1722689 0.0210084 0.0672269 0.0294118 0.0714286 0.0000000 0.1470588 -0.0420168 -0.0042017 -0.0126050 -0.0126050 0.0714286 0.1176471 -0.0588235 -0.0630252 -0.0252101
BREASTFEEDING_FREQUENCY 47.4916142 -11.0597553 -13.5679125 -3.1268767 0.1662213 13.9219788 1753.6700538 6.2663584 -9.1243803 -4.0669456 -9.4325621 8.5275658 -7.4449738 0.8008509 -24.0542702 -1.7838332 222.0378151 1.032018e+05 81188.4077740 -158.8968039 -41.0781970 2.8436236 -7.5073837 20.4049787 -15.0492775 24.7810028 -0.8092542 0.7039309 27.4577898 -12.1717591 -4.3966984 11.0369537 -15.2968426 -2.0703913 -2.9024472
BREASTFEEDING_DURING_NIGHT 1.3431314 -7.7784009 -14.2560740 -3.9717134 6.7569178 7.3551387 -1927.9782532 3.7417461 -11.7260469 -8.0309061 -0.7563728 15.1098590 -11.2382300 -5.0638691 -6.8810696 0.1058155 184.4873950 8.118841e+04 83566.6301818 -127.4657712 -31.1560072 1.9344784 -2.3336732 9.5008614 -18.1670476 20.0365845 0.8831968 7.1224992 10.4734538 0.9171970 -6.2053022 -2.4352871 -9.5845259 3.2027355 -1.9302767
BOTTLE_FEEDING 0.0333146 -0.0303787 0.0368658 0.0014416 -0.0064344 -0.0223797 -0.5337717 -0.0196899 0.0304314 0.0162266 0.0342288 0.0439858 -0.0488907 0.0073837 0.0738546 0.0058366 -0.1722689 -1.588968e+02 -127.4657712 0.8254984 0.1780880 0.0575578 0.0162793 -0.0485039 0.0646074 -0.0249464 0.0136247 -0.0256496 -0.0135192 0.0165430 0.0169825 -0.0105657 0.0085088 -0.0302380 -0.0365493
INFANT_FORMULAS -0.0403115 0.0440737 0.0007208 -0.0087198 -0.0067860 -0.0050280 -3.4669667 0.0216413 0.0065399 0.0020745 0.0120601 0.0268275 -0.0111986 -0.0065399 0.0509124 0.0059949 0.0210084 -4.107820e+01 -31.1560072 0.1780880 0.2468268 -0.0088429 0.0484863 -0.0069794 0.0184065 0.0051686 0.0233466 -0.0338244 0.0022503 0.0074364 0.0412609 -0.0121304 -0.0164200 -0.0170353 0.0130445
ADDITIONAL_FOOD_SWEETENING 0.0467107 -0.0134841 0.0060124 0.0188812 0.0533561 -0.0081221 2.2320945 -0.0251573 -0.0036567 0.0204810 0.0563271 0.0953729 0.0239267 0.0120601 0.1400970 0.0022503 0.0672269 2.843624e+00 1.9344784 0.0575578 -0.0088429 0.4954116 0.0159453 -0.0219402 0.0025843 -0.0142752 0.0173517 -0.0252277 -0.0642382 0.0310819 -0.0025140 -0.0451285 0.0665588 0.0332443 0.0340002
CHILD_FLUORIDE_SUPPLEMENTS 0.0702331 -0.0416828 -0.0164024 0.0099680 -0.0467459 0.0113217 1.9909637 -0.0241377 0.0166837 0.0102844 -0.0521606 -0.0593509 0.0691431 0.0253331 -0.0323125 0.0065399 0.0294118 -7.507384e+00 -2.3336732 0.0162793 0.0484863 0.0159453 0.2752013 0.0622868 0.0399423 0.0052565 -0.0043247 0.0150487 0.1223937 -0.0262649 0.1015435 0.0585774 -0.0878134 -0.0369185 -0.0075419
CHILD_FLUORIDE_TOOTHPASTE 0.2147076 -0.0391688 0.0087550 0.0137829 -0.0905207 0.0434584 0.2367533 -0.0611793 0.0064695 0.0025491 -0.1083823 -0.1031258 0.0349144 0.0649590 -0.1596287 -0.0094758 0.0714286 2.040498e+01 9.5008614 -0.0485039 -0.0069794 -0.0219402 0.0622868 0.4841954 0.0316269 0.1054112 -0.0355473 0.0119897 0.0457790 -0.0255793 0.0926831 0.0801660 -0.0948103 -0.0633417 -0.0116733
CHILD_ORAL_HYGIENE 0.1758553 -0.0178088 -0.0012130 0.0247178 -0.0782497 0.0710770 -1.4686896 -0.0536374 0.0591927 0.0783904 -0.0998207 -0.0740480 0.0110580 0.0626560 -0.0879364 -0.0112162 0.0000000 -1.504928e+01 -18.1670476 0.0646074 0.0184065 0.0025843 0.0399423 0.0316269 0.2583243 0.1817095 -0.0374459 -0.0157343 0.0823986 -0.0388524 0.0735206 0.1057804 -0.0895011 -0.0573116 -0.0319961
CHILD_TOOTH_BRUSHING 0.2232868 -0.1661334 -0.0485215 0.0223269 -0.0585598 0.0951795 -2.8484231 -0.0866707 0.0483809 0.0893956 -0.1188777 -0.0837699 0.0431595 0.0692662 -0.1309026 0.0177385 0.1470588 2.478100e+01 20.0365845 -0.0249464 0.0051686 -0.0142752 0.0052565 0.1054112 0.1817095 0.7557751 -0.0608628 0.0428958 0.1698956 -0.0288844 0.0710770 0.0841567 -0.0800429 -0.0655743 -0.0319433
DIARRHEA_DURING_INFANCY -0.1511902 0.0130797 0.0056608 -0.0243135 0.0514398 -0.0221687 -2.2147428 0.0234169 -0.0521430 -0.0408917 0.0890791 0.0766499 -0.0277944 -0.0444956 0.1442636 0.0201294 -0.0420168 -8.092542e-01 0.8831968 0.0136247 0.0233466 0.0173517 -0.0043247 -0.0355473 -0.0374459 -0.0608628 0.0907141 -0.0204107 -0.0680004 0.0188460 -0.0321719 -0.0832777 0.0479238 0.0461657 0.0162617
MEDICAL_SYRUPS 0.0760346 -0.0340354 -0.0199712 -0.0148026 0.0121655 -0.0675961 -0.3912837 -0.0040083 0.0005977 -0.0313456 -0.0287613 0.0037622 0.0303084 0.0036040 -0.0381316 -0.0113568 -0.0042017 7.039309e-01 7.1224992 -0.0256496 -0.0338244 -0.0252277 0.0150487 0.0119897 -0.0157343 0.0428958 -0.0204107 0.2618403 -0.0044478 -0.0176857 -0.0022151 0.0138005 -0.0038501 -0.0517738 -0.0335959
CHILD_FIRST_DENTIST_VISIT 0.2295805 -0.1005415 -0.0208678 0.0626560 -0.0723427 0.0291481 1.0518442 -0.0412784 0.1714954 0.1186667 -0.1399916 -0.1521747 0.0198481 0.0427903 -0.2445238 0.0020921 -0.0126050 2.745779e+01 10.4734538 -0.0135192 0.0022503 -0.0642382 0.1223937 0.0457790 0.0823986 0.1698956 -0.0680004 -0.0044478 0.8653704 -0.0690552 0.0110228 0.0992933 -0.1185964 -0.0850005 0.0096164
SWEETS_DURING_PREGNANCY -0.0888330 0.0190746 0.0233114 0.0154882 0.0333497 -0.0111459 -5.4078795 0.0068387 -0.0525825 -0.0347737 0.0642558 0.0375514 0.0324707 -0.0272494 0.0846841 0.0158750 -0.0126050 -1.217176e+01 0.9171970 0.0165430 0.0074364 0.0310819 -0.0262649 -0.0255793 -0.0388524 -0.0288844 0.0188460 -0.0176857 -0.0690552 0.2179600 -0.0248585 -0.0337365 0.0548328 0.0403643 0.0067332
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 0.0830315 -0.0309237 0.0558876 0.0008790 -0.0880595 0.0564502 -4.8317218 -0.0311698 -0.0069618 -0.0244366 -0.0750501 -0.0418410 0.1282128 0.0531803 -0.0830667 -0.0063816 0.0714286 -4.396698e+00 -6.2053022 0.0169825 0.0412609 -0.0025140 0.1015435 0.0926831 0.0735206 0.0710770 -0.0321719 -0.0022151 0.0110228 -0.0248585 0.6160121 0.0617067 -0.1429978 -0.0801308 -0.0058894
ORAL_HEALTH_DURING_PREGNANCY 0.3362751 -0.0452692 -0.0096867 0.0354066 -0.1114061 0.0018811 0.2974052 -0.0414015 0.0973946 0.1156957 -0.1705812 -0.1366162 0.0914701 0.0916810 -0.2072712 -0.0311698 0.1176471 1.103695e+01 -2.4352871 -0.0105657 -0.0121304 -0.0451285 0.0585774 0.0801660 0.1057804 0.0841567 -0.0832777 0.0138005 0.0992933 -0.0337365 0.0617067 0.5309237 -0.1268415 -0.1219542 -0.0431068
MOTHER_HEALTH_AWARENESS -0.2241307 0.0427903 -0.0110052 -0.0246827 0.0967441 -0.0644492 2.7656728 0.0409620 -0.1159418 -0.1070989 0.1759783 0.1765761 -0.0793748 -0.0773355 0.2484793 0.0180198 -0.0588235 -1.529684e+01 -9.5845259 0.0085088 -0.0164200 0.0665588 -0.0878134 -0.0948103 -0.0895011 -0.0800429 0.0479238 -0.0038501 -0.1185964 0.0548328 -0.1429978 -0.1268415 0.2486551 0.1208291 0.0468865
FATHER_HEALTH_AWARENESS -0.3318449 0.0415597 0.0091769 -0.0167364 0.0790057 -0.0329630 -10.2085546 0.0166661 -0.0798847 -0.0826272 0.1607187 0.1630393 -0.0778102 -0.0671741 0.2118421 0.0178088 -0.0630252 -2.070391e+00 3.2027355 -0.0302380 -0.0170353 0.0332443 -0.0369185 -0.0633417 -0.0573116 -0.0655743 0.0461657 -0.0517738 -0.0850005 0.0403643 -0.0801308 -0.1219542 0.1208291 0.3035055 0.0591927
ECC -0.1265427 0.0344925 0.0444077 -0.0272846 0.0096691 -0.0212897 -6.4421785 -0.0000527 -0.0383601 -0.0666995 0.0823283 0.0852994 -0.0365318 -0.0120601 0.0951971 0.0271615 -0.0252101 -2.902447e+00 -1.9302767 -0.0365493 0.0130445 0.0340002 -0.0075419 -0.0116733 -0.0319961 -0.0319433 0.0162617 -0.0335959 0.0096164 0.0067332 -0.0058894 -0.0431068 0.0468865 0.0591927 0.2096973
kable(cor(NUM))
CHILD_ETHNICITY CHILD_AGE CHILD_GENDER CHILD_SERBIAN_LANGUAGE MOTHER_AGE MARITAL_STATUS MOTHER_ETHNICITY MOTHER_SERBIAN_LANGUAGE NUMBER_OF_CHILDREN BIRTH_ORDER MOTHER_EDUCATION_LEVEL MOTHER_EMPLOYMENT_STATUS QUALITY_OF_HOUSING HOUSING_CONDITIONS HOUSEHOLD_MONTHLY_INCOME BIRTH_WEIGHT BREASTFEEDING BREASTFEEDING_FREQUENCY BREASTFEEDING_DURING_NIGHT BOTTLE_FEEDING INFANT_FORMULAS ADDITIONAL_FOOD_SWEETENING CHILD_FLUORIDE_SUPPLEMENTS CHILD_FLUORIDE_TOOTHPASTE CHILD_ORAL_HYGIENE CHILD_TOOTH_BRUSHING DIARRHEA_DURING_INFANCY MEDICAL_SYRUPS CHILD_FIRST_DENTIST_VISIT SWEETS_DURING_PREGNANCY FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY ORAL_HEALTH_DURING_PREGNANCY MOTHER_HEALTH_AWARENESS FATHER_HEALTH_AWARENESS ECC
CHILD_ETHNICITY 1.0000000 -0.1749489 0.0458069 0.3125799 -0.2236191 0.0584724 0.2484739 -0.2700453 0.2590974 0.2653392 -0.5062151 -0.3786649 0.0846922 0.4043858 -0.4509273 -0.1693861 0.1872540 0.0996975 0.0031334 0.0247279 -0.0547197 0.0447553 0.0902875 0.2080885 0.2333370 0.1732119 -0.3385299 0.1002083 0.1664351 -0.1283208 0.0713443 0.3112358 -0.3031192 -0.4062213 -0.1863595
CHILD_AGE -0.1749489 1.0000000 0.0682532 -0.1128055 0.1843916 0.0252681 0.0860941 0.0400513 -0.0134590 -0.0554175 0.1337950 0.1173478 -0.0615070 -0.1092632 0.1735880 0.0530263 -0.0837080 -0.0441101 -0.0344754 -0.0428397 0.1136630 -0.0245456 -0.1018046 -0.0721217 -0.0448939 -0.2448479 0.0556412 -0.0852213 -0.1384778 0.0523483 -0.0504815 -0.0796017 0.1099469 0.0966552 0.0965081
CHILD_GENDER 0.0458069 0.0682532 1.0000000 -0.0770224 -0.1768189 0.0310033 -0.0794708 -0.1465002 0.0149906 -0.0387792 -0.0616617 -0.0119567 0.0930756 0.0739673 0.0135062 -0.0173448 -0.0652926 -0.0844175 -0.0985704 0.0811014 0.0028999 0.0170738 -0.0624949 0.0251482 -0.0047704 -0.1115580 0.0375670 -0.0780096 -0.0448371 0.0998029 0.1423259 -0.0265720 -0.0441127 0.0332947 0.1938318
CHILD_SERBIAN_LANGUAGE 0.3125799 -0.1128055 -0.0770224 1.0000000 -0.1814429 0.0303336 -0.0536624 -0.1506973 0.2155456 0.2105393 -0.3156437 -0.2095788 0.0520194 0.1547974 -0.2379001 -0.1298140 0.0000000 -0.0285229 -0.0402614 0.0046495 -0.0514325 0.0786092 0.0556814 0.0580441 0.1425132 0.0752592 -0.2365577 -0.0847708 0.1973736 0.0972165 0.0032819 0.1423954 -0.1450510 -0.0890238 -0.1746014
MOTHER_AGE -0.2236191 0.1843916 -0.1768189 -0.1814429 1.0000000 -0.1289554 0.0322207 0.1874197 0.0557514 0.1258653 0.3549148 0.2798053 -0.1904334 -0.3326204 0.3433659 0.2685220 -0.0508174 0.0007155 0.0323214 -0.0097928 -0.0188875 0.1048237 -0.1232187 -0.1798855 -0.2128917 -0.0931455 0.2361677 0.0328754 -0.1075357 0.0987785 -0.1551458 -0.2114226 0.2682777 0.1983049 0.0291978
MARITAL_STATUS 0.0584724 0.0252681 0.0310033 0.0303336 -0.1289554 1.0000000 0.0430172 -0.2171557 -0.0729327 -0.0450615 -0.1289093 -0.0731570 0.0551055 0.2078917 -0.1301317 0.0325620 0.0514674 0.0780334 0.0458139 -0.0443525 -0.0182229 -0.0207782 0.0388605 0.1124570 0.2518079 0.1971379 -0.1325336 -0.2378627 0.0564198 -0.0429882 0.1295072 0.0046485 -0.2327245 -0.1077373 -0.0837136
MOTHER_ETHNICITY 0.2484739 0.0860941 -0.0794708 -0.0536624 0.0322207 0.0430172 1.0000000 0.0195604 0.0215248 0.0249630 -0.0068414 -0.0081914 -0.0073500 -0.0443807 0.0292392 0.0444335 0.0299848 0.0381773 -0.0466430 -0.0041086 -0.0488039 0.0221784 0.0265423 0.0023795 -0.0202092 -0.0229144 -0.0514265 -0.0053478 0.0079077 -0.0810102 -0.0430535 0.0028545 0.0387885 -0.1295931 -0.0983869
MOTHER_SERBIAN_LANGUAGE -0.2700453 0.0400513 -0.1465002 -0.1506973 0.1874197 -0.2171557 0.0195604 1.0000000 -0.2186217 -0.1291851 0.2324506 0.2453748 -0.0904828 -0.3009693 0.2272624 0.1342954 -0.0184043 0.0439592 0.0291700 -0.0488386 0.0981670 -0.0805490 -0.1036929 -0.1981402 -0.2378281 -0.2246747 0.1752146 -0.0176531 -0.1000001 0.0330115 -0.0894989 -0.1280497 0.1851232 0.0681755 -0.0002596
NUMBER_OF_CHILDREN 0.2590974 -0.0134590 0.0149906 0.2155456 0.0557514 -0.0729327 0.0215248 -0.2186217 1.0000000 0.8100678 -0.2958563 -0.3345987 0.0885621 0.3032983 -0.3498081 -0.1193109 -0.0540986 -0.0376300 -0.0537415 0.0443750 0.0174400 -0.0068830 0.0421348 0.0123179 0.1542980 0.0737313 -0.2293684 0.0015476 0.2442452 -0.1492203 -0.0117517 0.1770898 -0.3080463 -0.1921122 -0.1109834
BIRTH_ORDER 0.2653392 -0.0554175 -0.0387792 0.2105393 0.1258653 -0.0450615 0.0249630 -0.1291851 0.8100678 1.0000000 -0.2913549 -0.3080443 0.0620992 0.2819634 -0.3125075 0.0169959 -0.0052460 -0.0162645 -0.0356915 0.0229449 0.0053645 0.0373840 0.0251868 0.0047065 0.1981514 0.1321104 -0.1744274 -0.0787001 0.1638872 -0.0956930 -0.0400002 0.2039944 -0.2759329 -0.1926890 -0.1871299
MOTHER_EDUCATION_LEVEL -0.5062151 0.1337950 -0.0616617 -0.3156437 0.3549148 -0.1289093 -0.0068414 0.2324506 -0.2958563 -0.2913549 1.0000000 0.5153962 -0.2048867 -0.5012175 0.6296877 0.2192816 -0.0788070 -0.0435909 -0.0038845 0.0559298 0.0360382 0.1188078 -0.1476141 -0.2312374 -0.2915736 -0.2030085 0.4390853 -0.0834450 -0.2234144 0.2043311 -0.1419603 -0.3475565 0.5239270 0.4331051 0.2669090
MOTHER_EMPLOYMENT_STATUS -0.3786649 0.1173478 -0.0119567 -0.2095788 0.2798053 -0.0731570 -0.0081914 0.2453748 -0.3345987 -0.3080443 0.5153962 1.0000000 -0.1653301 -0.4251562 0.5518383 0.2592017 0.0142261 0.0308273 0.0607014 0.0562224 0.0627102 0.1573608 -0.1313884 -0.1721122 -0.1691943 -0.1119042 0.2955486 0.0085384 -0.1899748 0.0934099 -0.0619102 -0.2177413 0.4112329 0.3436875 0.2163238
QUALITY_OF_HOUSING 0.0846922 -0.0615070 0.0930756 0.0520194 -0.1904334 0.0551055 -0.0073500 -0.0904828 0.0885621 0.0620992 -0.2048867 -0.1653301 1.0000000 0.1792047 -0.2062449 -0.1012751 0.0587079 -0.0256308 -0.0429956 -0.0595128 -0.0249293 0.0375961 0.1457693 0.0554928 0.0240622 0.0549064 -0.1020615 0.0655068 0.0235972 0.0769212 0.1806671 0.1388370 -0.1760461 -0.1562052 -0.0882301
HOUSING_CONDITIONS 0.4043858 -0.1092632 0.0739673 0.1547974 -0.3326204 0.2078917 -0.0443807 -0.3009693 0.3032983 0.2819634 -0.5012175 -0.4251562 0.1792047 1.0000000 -0.4128096 -0.1825417 0.1355732 0.0082770 -0.0581606 0.0269823 -0.0437053 0.0568891 0.1603343 0.3099502 0.4093010 0.2645377 -0.4905039 0.0233842 0.1527240 -0.1937899 0.2249668 0.4177592 -0.5149238 -0.4048382 -0.0874411
HOUSEHOLD_MONTHLY_INCOME -0.4509273 0.1735880 0.0135062 -0.2379001 0.3433659 -0.1301317 0.0292392 0.2272624 -0.3498081 -0.3125075 0.6296877 0.5518383 -0.2062449 -0.4128096 1.0000000 0.3137724 -0.0523160 -0.0685241 -0.0217838 0.0743900 0.0937827 0.1821549 -0.0563690 -0.2099402 -0.1583366 -0.1377993 0.4383431 -0.0681964 -0.2405553 0.1660001 -0.0968562 -0.2603262 0.4560228 0.3519037 0.1902489
BIRTH_WEIGHT -0.1693861 0.0530263 -0.0173448 -0.1298140 0.2685220 0.0325620 0.0444335 0.1342954 -0.1193109 0.0169959 0.2192816 0.2592017 -0.1012751 -0.1825417 0.3137724 1.0000000 0.0000000 -0.0191671 0.0012635 0.0221744 0.0416514 0.0110357 0.0430318 -0.0470056 -0.0761745 0.0704314 0.2306957 -0.0766100 0.0077628 0.1173737 -0.0280662 -0.1476604 0.1247374 0.1115829 0.2047404
BREASTFEEDING 0.1872540 -0.0837080 -0.0652926 0.0000000 -0.0508174 0.0514674 0.0299848 -0.0184043 -0.0540986 -0.0052460 -0.0788070 0.0142261 0.0587079 0.1355732 -0.0523160 0.0000000 1.0000000 0.6716940 0.6202096 -0.1842625 0.0410946 0.0928214 0.0544859 0.0997585 0.0000000 0.1643929 -0.1355732 -0.0079798 -0.0131684 -0.0262388 0.0884434 0.1569109 -0.1146412 -0.1111781 -0.0535015
BREASTFEEDING_FREQUENCY 0.0996975 -0.0441101 -0.0844175 -0.0285229 0.0007155 0.0780334 0.0381773 0.0439592 -0.0376300 -0.0162645 -0.0435909 0.0308273 -0.0256308 0.0082770 -0.0685241 -0.0191671 0.6716940 1.0000000 0.8742465 -0.5443940 -0.2573781 0.0125761 -0.0445471 0.0912814 -0.0921700 0.0887317 -0.0083638 0.0042822 0.0918800 -0.0811561 -0.0174377 0.0471508 -0.0954903 -0.0116984 -0.0197299
BREASTFEEDING_DURING_NIGHT 0.0031334 -0.0344754 -0.0985704 -0.0402614 0.0323214 0.0458139 -0.0466430 0.0291700 -0.0537415 -0.0356915 -0.0038845 0.0607014 -0.0429956 -0.0581606 -0.0217838 0.0012635 0.6202096 0.8742465 1.0000000 -0.4853097 -0.2169348 0.0095075 -0.0153886 0.0472320 -0.1236475 0.0797280 0.0101439 0.0481502 0.0389469 0.0067961 -0.0273497 -0.0115616 -0.0664899 0.0201104 -0.0145817
BOTTLE_FEEDING 0.0247279 -0.0428397 0.0811014 0.0046495 -0.0097928 -0.0443525 -0.0041086 -0.0488386 0.0443750 0.0229449 0.0559298 0.0562224 -0.0595128 0.0269823 0.0743900 0.0221744 -0.1842625 -0.5443940 -0.4853097 1.0000000 0.3945303 0.0900042 0.0341549 -0.0767200 0.1399078 -0.0315830 0.0497888 -0.0551701 -0.0159953 0.0390003 0.0238149 -0.0159597 0.0187808 -0.0604105 -0.0878466
INFANT_FORMULAS -0.0547197 0.1136630 0.0028999 -0.0514325 -0.0188875 -0.0182229 -0.0488039 0.0981670 0.0174400 0.0053645 0.0360382 0.0627102 -0.0249293 -0.0437053 0.0937827 0.0416514 0.0410946 -0.2573781 -0.2169348 0.3945303 1.0000000 -0.0252880 0.1860364 -0.0201887 0.0728942 0.0119669 0.1560234 -0.1330503 0.0048690 0.0320613 0.1058151 -0.0335090 -0.0662792 -0.0622400 0.0573372
ADDITIONAL_FOOD_SWEETENING 0.0447553 -0.0245456 0.0170738 0.0786092 0.1048237 -0.0207782 0.0221784 -0.0805490 -0.0068830 0.0373840 0.1188078 0.1573608 0.0375961 0.0568891 0.1821549 0.0110357 0.0928214 0.0125761 0.0095075 0.0900042 -0.0252880 1.0000000 0.0431841 -0.0447967 0.0072240 -0.0233293 0.0818506 -0.0700448 -0.0981092 0.0945880 -0.0045508 -0.0879938 0.1896374 0.0857335 0.1054878
CHILD_FLUORIDE_SUPPLEMENTS 0.0902875 -0.1018046 -0.0624949 0.0556814 -0.1232187 0.0388605 0.0265423 -0.1036929 0.0421348 0.0251868 -0.1476141 -0.1313884 0.1457693 0.1603343 -0.0563690 0.0430318 0.0544859 -0.0445471 -0.0153886 0.0341549 0.1860364 0.0431841 1.0000000 0.1706321 0.1498048 0.0115259 -0.0273714 0.0560603 0.2508031 -0.1072413 0.2466224 0.1532459 -0.3356887 -0.1277426 -0.0313950
CHILD_FLUORIDE_TOOTHPASTE 0.2080885 -0.0721217 0.0251482 0.0580441 -0.1798855 0.1124570 0.0023795 -0.1981402 0.0123179 0.0047065 -0.2312374 -0.1721122 0.0554928 0.3099502 -0.2099402 -0.0470056 0.0997585 0.0912814 0.0472320 -0.0767200 -0.0201887 -0.0447967 0.1706321 1.0000000 0.0894259 0.1742530 -0.1696128 0.0336729 0.0707220 -0.0787389 0.1697054 0.1581116 -0.2732414 -0.1652326 -0.0366342
CHILD_ORAL_HYGIENE 0.2333370 -0.0448939 -0.0047704 0.1425132 -0.2128917 0.2518079 -0.0202092 -0.2378281 0.1542980 0.1981514 -0.2915736 -0.1691943 0.0240622 0.4093010 -0.1583366 -0.0761745 0.0000000 -0.0921700 -0.1236475 0.1399078 0.0728942 0.0072240 0.1498048 0.0894259 1.0000000 0.4112432 -0.2446159 -0.0604989 0.1742756 -0.1637369 0.1843028 0.2856318 -0.3531400 -0.2046807 -0.1374730
CHILD_TOOTH_BRUSHING 0.1732119 -0.2448479 -0.1115580 0.0752592 -0.0931455 0.1971379 -0.0229144 -0.2246747 0.0737313 0.1321104 -0.2030085 -0.1119042 0.0549064 0.2645377 -0.1377993 0.0704314 0.1643929 0.0887317 0.0797280 -0.0315830 0.0119669 -0.0233293 0.0115259 0.1742530 0.4112432 1.0000000 -0.2324441 0.0964274 0.2100800 -0.0711669 0.1041689 0.1328545 -0.1846409 -0.1369161 -0.0802393
DIARRHEA_DURING_INFANCY -0.3385299 0.0556412 0.0375670 -0.2365577 0.2361677 -0.1325336 -0.0514265 0.1752146 -0.2293684 -0.1744274 0.4390853 0.2955486 -0.1020615 -0.4905039 0.4383431 0.2306957 -0.1355732 -0.0083638 0.0101439 0.0497888 0.1560234 0.0818506 -0.0273714 -0.1696128 -0.2446159 -0.2324441 1.0000000 -0.1324347 -0.2427019 0.1340276 -0.1360956 -0.3794679 0.3190912 0.2782269 0.1179052
MEDICAL_SYRUPS 0.1002083 -0.0852213 -0.0780096 -0.0847708 0.0328754 -0.2378627 -0.0053478 -0.0176531 0.0015476 -0.0787001 -0.0834450 0.0085384 0.0655068 0.0233842 -0.0681964 -0.0766100 -0.0079798 0.0042822 0.0481502 -0.0551701 -0.1330503 -0.0700448 0.0560603 0.0336729 -0.0604989 0.0964274 -0.1324347 1.0000000 -0.0093439 -0.0740315 -0.0055155 0.0370135 -0.0150887 -0.1836576 -0.1433743
CHILD_FIRST_DENTIST_VISIT 0.1664351 -0.1384778 -0.0448371 0.1973736 -0.1075357 0.0564198 0.0079077 -0.1000001 0.2442452 0.1638872 -0.2234144 -0.1899748 0.0235972 0.1527240 -0.2405553 0.0077628 -0.0131684 0.0918800 0.0389469 -0.0159953 0.0048690 -0.0981092 0.2508031 0.0707220 0.1742756 0.2100800 -0.2427019 -0.0093439 1.0000000 -0.1590037 0.0150972 0.1464882 -0.2556653 -0.1658583 0.0225743
SWEETS_DURING_PREGNANCY -0.1283208 0.0523483 0.0998029 0.0972165 0.0987785 -0.0429882 -0.0810102 0.0330115 -0.1492203 -0.0956930 0.2043311 0.0934099 0.0769212 -0.1937899 0.1660001 0.1173737 -0.0262388 -0.0811561 0.0067961 0.0390003 0.0320613 0.0945880 -0.1072413 -0.0787389 -0.1637369 -0.0711669 0.1340276 -0.0740315 -0.1590037 1.0000000 -0.0678409 -0.0991735 0.2355339 0.1569370 0.0314948
FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY 0.0713443 -0.0504815 0.1423259 0.0032819 -0.1551458 0.1295072 -0.0430535 -0.0894989 -0.0117517 -0.0400002 -0.1419603 -0.0619102 0.1806671 0.2249668 -0.0968562 -0.0280662 0.0884434 -0.0174377 -0.0273497 0.0238149 0.1058151 -0.0045508 0.2466224 0.1697054 0.1843028 0.1041689 -0.1360956 -0.0055155 0.0150972 -0.0678409 1.0000000 0.1079000 -0.3653726 -0.1853197 -0.0163862
ORAL_HEALTH_DURING_PREGNANCY 0.3112358 -0.0796017 -0.0265720 0.1423954 -0.2114226 0.0046485 0.0028545 -0.1280497 0.1770898 0.2039944 -0.3475565 -0.2177413 0.1388370 0.4177592 -0.2603262 -0.1476604 0.1569109 0.0471508 -0.0115616 -0.0159597 -0.0335090 -0.0879938 0.1532459 0.1581116 0.2856318 0.1328545 -0.3794679 0.0370135 0.1464882 -0.0991735 0.1079000 1.0000000 -0.3490975 -0.3038068 -0.1291913
MOTHER_HEALTH_AWARENESS -0.3031192 0.1099469 -0.0441127 -0.1450510 0.2682777 -0.2327245 0.0387885 0.1851232 -0.3080463 -0.2759329 0.5239270 0.4112329 -0.1760461 -0.5149238 0.4560228 0.1247374 -0.1146412 -0.0954903 -0.0664899 0.0187808 -0.0662792 0.1896374 -0.3356887 -0.2732414 -0.3531400 -0.1846409 0.3190912 -0.0150887 -0.2556653 0.2355339 -0.3653726 -0.3490975 1.0000000 0.4398347 0.2053303
FATHER_HEALTH_AWARENESS -0.4062213 0.0966552 0.0332947 -0.0890238 0.1983049 -0.1077373 -0.1295931 0.0681755 -0.1921122 -0.1926890 0.4331051 0.3436875 -0.1562052 -0.4048382 0.3519037 0.1115829 -0.1111781 -0.0116984 0.0201104 -0.0604105 -0.0622400 0.0857335 -0.1277426 -0.1652326 -0.2046807 -0.1369161 0.2782269 -0.1836576 -0.1658583 0.1569370 -0.1853197 -0.3038068 0.4398347 1.0000000 0.2346327
ECC -0.1863595 0.0965081 0.1938318 -0.1746014 0.0291978 -0.0837136 -0.0983869 -0.0002596 -0.1109834 -0.1871299 0.2669090 0.2163238 -0.0882301 -0.0874411 0.1902489 0.2047404 -0.0535015 -0.0197299 -0.0145817 -0.0878466 0.0573372 0.1054878 -0.0313950 -0.0366342 -0.1374730 -0.0802393 0.1179052 -0.1433743 0.0225743 0.0314948 -0.0163862 -0.1291913 0.2053303 0.2346327 1.0000000

Box Plots Dealing with Outliers

To be able to have an idea about the outliers, we should plot boxplots of the numerical attributes.

for (col in 2:ncol(TRAIN)) {
  boxplot(TRAIN[,col],main=paste("Boxplot of the",colnames(TRAIN)[col] ))
}

  • All the boxplots were observed to have an opinion the outliers and their affects on the dataset analysis.

3. Classification Methods

library(ade4)
library(data.table)

#COMBINE ALL DATA TO HAVE CONSISTENT 
ALL_DATA <- rbind(TRAIN, VALIDATION, TEST)
ALL_DATA_x <- ALL_DATA[,1:35]
ALL_DATA_y <- ALL_DATA[36]

#APPLY ONE HOT METHOD TO CATEGORICAL AND NULL(999) INVOLVING FEATURES
col_names <- c("CITY", "CHILD_ETHNICITY", "MOTHER_ETHNICITY", "BREASTFEEDING_FREQUENCY", "BREASTFEEDING_DURING_NIGHT", "MOTHER_EMPLOYMENT_STATUS")
for (f in col_names){
  df_all_dummy = acm.disjonctif(ALL_DATA_x[f])
  ALL_DATA_x[f] = NULL
  ALL_DATA_x = cbind(ALL_DATA_x, df_all_dummy)
}

#DELETE .999 FEATURES
col_names999 <- c("MOTHER_ETHNICITY.999", "BREASTFEEDING_FREQUENCY.999", "BREASTFEEDING_DURING_NIGHT.999")
for (f in col_names999){
  ALL_DATA_x[f] = NULL
}



#NORMALIZATION FUNCTION
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

#APPLY NORMALIZATION
ALL_DATA_x <- as.data.frame(lapply(ALL_DATA_x, normalize))

For ordered data

ALL_DATA_x_o = ALL_DATA[,1:35]

factor_vars = c("CITY", "CHILD_ETHNICITY", "CHILD_GENDER", "MOTHER_SERBIAN_LANGUAGE", 
                "CHILD_SERBIAN_LANGUAGE", "MARITAL_STATUS", "MOTHER_ETHNICITY")
ordered_vars = c("CHILD_AGE", "MOTHER_AGE", "BIRTH_ORDER", 
                "MOTHER_EDUCATION_LEVEL", "MOTHER_EMPLOYMENT_STATUS", "QUALITY_OF_HOUSING",
                "HOUSING_CONDITIONS", "HOUSEHOLD_MONTHLY_INCOME", 
                "BIRTH_WEIGHT", "BREASTFEEDING", "BREASTFEEDING_DURING_NIGHT",
                "BOTTLE_FEEDING", "INFANT_FORMULAS", "ADDITIONAL_FOOD_SWEETENING",
                "CHILD_FLUORIDE_SUPPLEMENTS", "CHILD_FLUORIDE_TOOTHPASTE", "CHILD_ORAL_HYGIENE",
                "CHILD_TOOTH_BRUSHING", "DIARRHEA_DURING_INFANCY", "MEDICAL_SYRUPS",
                "CHILD_FIRST_DENTIST_VISIT", "SWEETS_DURING_PREGNANCY",
                "FLUORIDE_SUPPLEMENTS_DURING_PREGNANCY", "ORAL_HEALTH_DURING_PREGNANCY",
                "MOTHER_HEALTH_AWARENESS", "FATHER_HEALTH_AWARENESS")

#ORDERED
for (var in ordered_vars) ALL_DATA_x_o[,var] = ordered(ALL_DATA_x_o[,var])
for (var in factor_vars) ALL_DATA_x_o[,var] = factor(ALL_DATA_x_o[,var])

#APPLY ONE HOT METHOD TO CATEGORICAL AND NULL(999) INVOLVING FEATURES
col_names <- c("CITY", "CHILD_ETHNICITY", "MOTHER_ETHNICITY", "BREASTFEEDING_FREQUENCY", "BREASTFEEDING_DURING_NIGHT", "MOTHER_EMPLOYMENT_STATUS")
for (f in col_names){
  df_all_dummy = acm.disjonctif(ALL_DATA_x_o[f])
  ALL_DATA_x_o[f] = NULL
  ALL_DATA_x_o = cbind(ALL_DATA_x_o, df_all_dummy)
}

#DELETE .999 FEATURES
col_names999 <- c("MOTHER_ETHNICITY.999", "BREASTFEEDING_FREQUENCY.999", "BREASTFEEDING_DURING_NIGHT.999")
for (f in col_names999){
  ALL_DATA_x_o[f] = NULL
}

3.1. Association Rule Mining (implemented on the paper)

col_names <- colnames(TRAIN)
TRAIN_factor <- as.data.frame(lapply(TRAIN[,col_names], factor))

rules1 <- apriori(TRAIN_factor, appearance = list(rhs=c("ECC=1"), default="lhs"), parameter = list(minlen=2, maxlen=7, sup = 0.1, conf = 0.4, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.4    0.1    1 none FALSE            TRUE       5     0.1      2
##  maxlen target   ext
##       7  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 23 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[125 item(s), 239 transaction(s)] done [0.00s].
## sorting and recoding items ... [93 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7
## Warning in apriori(TRAIN_factor, appearance = list(rhs = c("ECC=1"),
## default = "lhs"), : Mining stopped (maxlen reached). Only patterns up to a
## length of 7 returned!
##  done [1.26s].
## writing ... [125 rule(s)] done [0.07s].
## creating S4 object  ... done [0.11s].
rules1<-sort(rules1, decreasing=TRUE, by="confidence")
#inspect(rules1)

rules2 <- apriori(TRAIN_factor, appearance = list(rhs=c("ECC=2"), default="lhs"), parameter = list(minlen=2, maxlen=7, sup = 0.3, conf = 0.8, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.3      2
##  maxlen target   ext
##       7  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 71 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[125 item(s), 239 transaction(s)] done [0.00s].
## sorting and recoding items ... [47 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7
## Warning in apriori(TRAIN_factor, appearance = list(rhs = c("ECC=2"),
## default = "lhs"), : Mining stopped (maxlen reached). Only patterns up to a
## length of 7 returned!
##  done [0.03s].
## writing ... [246 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules2<-sort(rules2, decreasing=TRUE, by="confidence")
#inspect(rules2)

Justification for Model Parameters

3.2. SVM

#SEPARATE TRAIN, VALIDATION AND TEST
TRAIN_conv_x <- ALL_DATA_x[1:239,]
VALIDATION_conv_x <- ALL_DATA_x[240:273,]
TEST_conv_x <- ALL_DATA_x[274:341,]

TRAIN_y <- TRAIN[,36]
TRAIN_y <- as.factor(TRAIN_y)
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#POSSIBLE COST AND GAMMA VALUES
cost_try = c(0.1, 0.5, 1, 5, 10, 20, 50, 80, 100, 500)
gamma_try = c(0.005, 0.01, 0.02, 0.05, 0.1, 0.5, 1, 2, 5, 10)

#BEST COST AND GAMMA VALUES SELECTED ACCORDING TO ACCURACY
max_accur = 0
best_cost = 1
best_gamma = 1
for (i in 1:10)
{
  for (j in 1:10)
  {
    svm_model <- svm(x = TRAIN_conv_x, y = TRAIN_y, gamma = gamma_try[j], cost = cost_try[i])
    svm_res <- predict(svm_model, VALIDATION_conv_x)
    conf_res <- confusionMatrix(svm_res, VALIDATION_y)
    
    if (max_accur < conf_res$overall[1])
    {
      max_accur = conf_res$overall[1]
      best_cost = cost_try[i]
      best_gamma = gamma_try[j]
      print(conf_res$overall[1])
    }
  }
}
##  Accuracy 
## 0.6764706 
##  Accuracy 
## 0.7058824 
##  Accuracy 
## 0.7647059
#BEST VALUES PRINTED
print(best_cost)
## [1] 5
print(best_gamma)
## [1] 0.01
#TEST DATASET IS PREDICTED AND RESULTS ARE DISPLAYED
svm_model <- svm(x = TRAIN_conv_x, y = TRAIN_y, gamma = best_gamma, cost = best_cost)
svm_res <- predict(svm_model, TEST_conv_x)
conf_res <- confusionMatrix(svm_res, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1  5  2
##          2 17 44
##                                           
##                Accuracy : 0.7206          
##                  95% CI : (0.5985, 0.8227)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.261543        
##                                           
##                   Kappa : 0.2236          
##  Mcnemar's Test P-Value : 0.001319        
##                                           
##             Sensitivity : 0.22727         
##             Specificity : 0.95652         
##          Pos Pred Value : 0.71429         
##          Neg Pred Value : 0.72131         
##              Prevalence : 0.32353         
##          Detection Rate : 0.07353         
##    Detection Prevalence : 0.10294         
##       Balanced Accuracy : 0.59190         
##                                           
##        'Positive' Class : 1               
## 

Justification for Model Parameters

  • For the SVM model, Cost is how much we penalize the SVM for data points within the margin. If we decrease the cost, the error rate would increase where the margin gets larger. Gamma defines how far the influence of single training example reaches.

  • If the value of Gamma is high, then our decision boundary will depend on points close to the decision boundary and nearer points carry more weights than far away points due to which our decision boundary becomes more wiggly.

  • If the value of Gamma is low, then far away points carry more weights than nearer points and thus our decision boundary becomes more like a straight line.

  • The value of gamma and C should not be very high because it leads to the overfitting or it shouldn’t be very small (underfitting). Thus we need to choose the optimal value of C and Gamma in order to get a good fit. In our case, different costs and Gamma values were tried an adjusted for the best performance.

3.3. KNN

#SEPARATE TRAIN, VALIDATION AND TEST
TRAIN_conv_x <- ALL_DATA_x[1:239,]
VALIDATION_conv_x <- ALL_DATA_x[240:273,]
TEST_conv_x <- ALL_DATA_x[274:341,]

TRAIN_y <- TRAIN[,36]
TRAIN_y <- as.factor(TRAIN_y)
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#BEST K VALUE IS SELECTED ACCORDING TO ACCURACY
max_accur = 0
best_k_val = 1
for (i in 1:100)
{
  test_pred <- knn(train = TRAIN_conv_x, test = VALIDATION_conv_x, cl = TRAIN_y, k=i)
  conf_res <- confusionMatrix(test_pred, VALIDATION_y)
  
  if (max_accur < conf_res$overall[1])
  {
    max_accur = conf_res$overall[1]
    best_k_val = i
    print(conf_res$overall[1])
  }
}
##  Accuracy 
## 0.7058824
#BEST VALUES PRINTED
print(best_k_val)
## [1] 1
#TEST DATASET IS PREDICTED AND RESULTS ARE DISPLAYED
test_pred <- knn(train = TRAIN_conv_x, test = TEST_conv_x, cl = TRAIN_y, k=best_k_val)
conf_res <- confusionMatrix(test_pred, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1  8  8
##          2 14 38
##                                           
##                Accuracy : 0.6765          
##                  95% CI : (0.5521, 0.7849)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.5575          
##                                           
##                   Kappa : 0.2043          
##  Mcnemar's Test P-Value : 0.2864          
##                                           
##             Sensitivity : 0.3636          
##             Specificity : 0.8261          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.7308          
##              Prevalence : 0.3235          
##          Detection Rate : 0.1176          
##    Detection Prevalence : 0.2353          
##       Balanced Accuracy : 0.5949          
##                                           
##        'Positive' Class : 1               
## 

We also try KNN on the ordered dataset.

#SEPARATE TRAIN, VALIDATION AND TEST
TRAIN_conv_x <- ALL_DATA_x_o[1:239,]
VALIDATION_conv_x <- ALL_DATA_x_o[240:273,]
TEST_conv_x <- ALL_DATA_x_o[274:341,]

TRAIN_y <- TRAIN[,36]
TRAIN_y <- as.factor(TRAIN_y)
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#BEST K VALUE IS SELECTED ACCORDING TO ACCURACY
max_accur = 0
best_k_val = 1
for (i in 1:100)
{
  test_pred <- knn(train = TRAIN_conv_x, test = VALIDATION_conv_x, cl = TRAIN_y, k=i)
  conf_res <- confusionMatrix(test_pred, VALIDATION_y)
  
  if (max_accur < conf_res$overall[1])
  {
    max_accur = conf_res$overall[1]
    best_k_val = i
    print(conf_res$overall[1])
  }
}
##  Accuracy 
## 0.5882353 
##  Accuracy 
## 0.6176471 
##  Accuracy 
## 0.6470588 
##  Accuracy 
## 0.6764706
#BEST VALUES PRINTED
print(best_k_val)
## [1] 39
#TEST DATASET IS PREDICTED AND RESULTS ARE DISPLAYED
test_pred <- knn(train = TRAIN_conv_x, test = TEST_conv_x, cl = TRAIN_y, k=best_k_val)
conf_res <- confusionMatrix(test_pred, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1  0  0
##          2 22 46
##                                           
##                Accuracy : 0.6765          
##                  95% CI : (0.5521, 0.7849)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.5575          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 7.562e-06       
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.6765          
##              Prevalence : 0.3235          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 1               
## 

Justification for Model Parameters

  • For the KNN model, the most and only important parameter is the ‘k value’. it looks through the training data and finds the k training examples that are closest to the new example. It then assigns the most common class label (among those k training examples) to the test example.

  • When the data is directly fed to the model, we observed that k=1 gives the best results within all k values. Normally, k=1 might show the appearance of overfitting. But in our case, it does not. As our class labels are nominal and have small number of types, 1-NN does not directly show overfitting.

  • Also, we tried this model for the ordered (nominal) dataset. The optimal k value is not 1 but equal to 39 in this case. But the accuracy result did not change surprisingly.

3.4. Naive Bayesian

#SEPARATE TEST
TEST_conv_x <- ALL_DATA_x[274:341,]
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#VALIDATION COMBINED WITH TRAIN
TV_conv_x <- ALL_DATA_x[1:273,]
TV_y <- c(TRAIN_y, VALIDATION_y)
TV_y <- as.factor(TV_y)

#BECAUSE OF NO PARAMETER SELECTION, NB APPLIED DIRECTLY
nb_model <- naiveBayes(x = TV_conv_x, y = TV_y, laplace = laplace)
nb_res <- predict(nb_model, TEST_conv_x)
conf_res <- confusionMatrix(nb_res, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 15 28
##          2  7 18
##                                           
##                Accuracy : 0.4853          
##                  95% CI : (0.3622, 0.6097)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.9996453       
##                                           
##                   Kappa : 0.0585          
##  Mcnemar's Test P-Value : 0.0007232       
##                                           
##             Sensitivity : 0.6818          
##             Specificity : 0.3913          
##          Pos Pred Value : 0.3488          
##          Neg Pred Value : 0.7200          
##              Prevalence : 0.3235          
##          Detection Rate : 0.2206          
##    Detection Prevalence : 0.6324          
##       Balanced Accuracy : 0.5366          
##                                           
##        'Positive' Class : 1               
## 

Justification for Model Parameters

  • Naive Bayesian is the simplest classification algorithm we used. And it does not have any special parameters to be justified. From the results, this model gives low accuracy and high sensitivity.

3.5. Random Forest

#SEPARATE TRAIN, VALIDATION AND TEST
TRAIN_conv_x <- ALL_DATA_x[1:239,]
VALIDATION_conv_x <- ALL_DATA_x[240:273,]
TEST_conv_x <- ALL_DATA_x[274:341,]

TRAIN_y <- TRAIN[,36]
TRAIN_y <- as.factor(TRAIN_y)
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#BEST NTREE VALUE IS SELECTED ACCORDING TO ACCURACY
max_accur = 0
res_num_of_tree = 0
num_of_tree = 16
for (i in 1:7)
{
  set.seed(97)
  
  rf_model <- randomForest(x = TRAIN_conv_x, y = TRAIN_y, ntree = num_of_tree)
  rf_res <- predict(rf_model, VALIDATION_conv_x)
  rf_res_round <- as.factor(round(as.numeric(rf_res)))
  conf_res <- confusionMatrix(rf_res_round, VALIDATION_y)
  
  if (conf_res$overall[1] > max_accur)
  {
    max_accur = conf_res$overall[1]
    res_num_of_tree = num_of_tree
    print(conf_res$overall[1])
  }
  
  num_of_tree = num_of_tree*2
}
##  Accuracy 
## 0.5882353 
##  Accuracy 
## 0.6470588 
##  Accuracy 
## 0.6764706
#BEST VALUES PRINTED
print(res_num_of_tree)
## [1] 64
#TEST DATASET IS PREDICTED AND RESULTS ARE DISPLAYED
set.seed(97)
rf_res_model <- randomForest(x = TRAIN_conv_x, y = TRAIN_y, ntree = res_num_of_tree)
rf_res <- predict(rf_model, TEST_conv_x)
rf_res_round <- as.factor(round(as.numeric(rf_res)))
conf_res <- confusionMatrix(rf_res_round, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1  4  2
##          2 18 44
##                                           
##                Accuracy : 0.7059          
##                  95% CI : (0.5829, 0.8102)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.3537252       
##                                           
##                   Kappa : 0.1707          
##  Mcnemar's Test P-Value : 0.0007962       
##                                           
##             Sensitivity : 0.18182         
##             Specificity : 0.95652         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.70968         
##              Prevalence : 0.32353         
##          Detection Rate : 0.05882         
##    Detection Prevalence : 0.08824         
##       Balanced Accuracy : 0.56917         
##                                           
##        'Positive' Class : 1               
## 

The same model applied to the ordered dataset.

#SEPARATE TRAIN, VALIDATION AND TEST
TRAIN_conv_x <- ALL_DATA_x_o[1:239,]
VALIDATION_conv_x <- ALL_DATA_x_o[240:273,]
TEST_conv_x <- ALL_DATA_x_o[274:341,]

TRAIN_y <- TRAIN[,36]
TRAIN_y <- as.factor(TRAIN_y)
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y)

#BEST NTREE VALUE IS SELECTED ACCORDING TO ACCURACY
max_accur = 0
res_num_of_tree = 0
ntrees = seq(2:1000:10)
## Warning in 2:1000:10: numerical expression has 999 elements: only the first
## used
for (i in ntrees)
{
  set.seed(97)
  
  rf_model <- randomForest(x = TRAIN_conv_x, y = TRAIN_y, ntree = i)
  rf_res <- predict(rf_model, VALIDATION_conv_x)
  rf_res_round <- as.factor(round(as.numeric(rf_res)))
  conf_res <- confusionMatrix(rf_res_round, VALIDATION_y)
  
  if (conf_res$overall[1] > max_accur)
  {
    max_accur = conf_res$overall[1]
    res_num_of_tree = num_of_tree
    print(conf_res$overall[1])
  }
}
##  Accuracy 
## 0.5588235 
##  Accuracy 
## 0.6764706 
##  Accuracy 
## 0.7647059
#BEST VALUES PRINTED
print(res_num_of_tree)
## [1] 2048
#TEST DATASET IS PREDICTED AND RESULTS ARE DISPLAYED
set.seed(97)
rf_res_model <- randomForest(x = TRAIN_conv_x, y = TRAIN_y, ntree = res_num_of_tree)
rf_res <- predict(rf_model, TEST_conv_x)
rf_res_round <- as.factor(round(as.numeric(rf_res)))
conf_res <- confusionMatrix(rf_res_round, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1  7  2
##          2 15 44
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6302, 0.8471)
##     No Information Rate : 0.6765          
##     P-Value [Acc > NIR] : 0.120363        
##                                           
##                   Kappa : 0.3248          
##  Mcnemar's Test P-Value : 0.003609        
##                                           
##             Sensitivity : 0.3182          
##             Specificity : 0.9565          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.7458          
##              Prevalence : 0.3235          
##          Detection Rate : 0.1029          
##    Detection Prevalence : 0.1324          
##       Balanced Accuracy : 0.6374          
##                                           
##        'Positive' Class : 1               
## 

Justification for Model Parameters

  • The most important parameter for this model is the number of tress. This parameter is tried for different values and with the performance comparison, it is justified.

  • Notice that When the data is not considered as nominal for the necessary attributes and given to the model directly, the number of tree parameter is equal to 64. But when we preprocess the data to specify its type, this parameter becomes 2048. There is a trade-off situation where increasing ‘number of tree’ parameter gives better accuracy but wastes more space in the memory.

3.6. ANN

library(neuralnet)

TRAIN_conv_x <- ALL_DATA_x[1:239,]
VALIDATION_conv_x <- ALL_DATA_x[240:273,]
TEST_conv_x <- ALL_DATA_x[274:341,]
TRAIN_y <- TRAIN[,36]
TRAIN_y <- TRAIN_y - 1
VALIDATION_y <- VALIDATION[,36]
VALIDATION_y <- as.factor(VALIDATION_y-1)
TEST_y <- TEST[,36]
TEST_y <- as.factor(TEST_y - 1)

train_data <- data.frame(TRAIN_conv_x, TRAIN_y) 
col_names = colnames(train_data)
for (i in col_names)
{
  train_data[,i] <- as.numeric(train_data[,i])
}
col_names = colnames(TRAIN_conv_x)
formula_asd <- as.formula(paste("TRAIN_y ~ ", paste(col_names, collapse = "+")))

VALIDATION_conv_x_nn <- VALIDATION_conv_x
col_names = colnames(VALIDATION_conv_x_nn)
for (i in col_names)
{
  VALIDATION_conv_x_nn[,i] <- as.numeric(VALIDATION_conv_x_nn[,i])
}

TEST_conv_x_nn <- TEST_conv_x
col_names = colnames(TEST_conv_x_nn)
for (i in col_names)
{
  TEST_conv_x_nn[,i] <- as.numeric(TEST_conv_x_nn[,i])
}

nn_result_f <- function(x) {
  ret_val = 0
  if ( x >= 0.5 )
  {
    ret_val <- 1
  }
  else 
  {
    ret_val <- 0
  }
  return (ret_val)
}

max_accur = 0
best_l1_num = 1
best_th = 1
for (i in 1:20)
{
  for (j in 1:5)
  {
    nn_model <- neuralnet(formula_asd, data=train_data, linear.output = TRUE, hidden=c(i,1), threshold=0.01*j)
    nn_res <- compute(nn_model, VALIDATION_conv_x_nn)$net.result
    nn_res <- as.numeric(lapply(nn_res, nn_result_f))

      nn_res <- as.factor(nn_res)
      conf_res <- confusionMatrix(nn_res, VALIDATION_y)
      if (max_accur < conf_res$overall[1])
      {
        max_accur = conf_res$overall[1]
        
        best_l1_num = i
        best_th = 0.01*j
        
        print(conf_res$overall[1])
      }

  }
  print(i)
}
##     Accuracy 
## 0.4705882353 
##     Accuracy 
## 0.5294117647 
##     Accuracy 
## 0.6764705882 
## [1] 1
## Warning in confusionMatrix.default(nn_res, VALIDATION_y): Levels are not in
## the same order for reference and data. Refactoring data to match.
## [1] 2
## [1] 3
##     Accuracy 
## 0.7352941176 
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
##     Accuracy 
## 0.7647058824 
## [1] 9
## Warning in confusionMatrix.default(nn_res, VALIDATION_y): Levels are not in
## the same order for reference and data. Refactoring data to match.
## [1] 10
## [1] 11
##     Accuracy 
## 0.7941176471 
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
##     Accuracy 
## 0.8235294118 
## [1] 18
## [1] 19
## [1] 20
print(best_l1_num)
## [1] 18
print(best_th)
## [1] 0.01
nn_model <- neuralnet(formula_asd, data=train_data, linear.output = TRUE, hidden=best_l1_num, threshold=best_th)
nn_res <- compute(nn_model, TEST_conv_x_nn)$net.result
nn_res <- as.numeric(lapply(nn_res, nn_result_f))
nn_res <- as.factor(nn_res)

conf_res <- confusionMatrix(nn_res, TEST_y)
print(conf_res)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 12  7
##          1 10 39
##                                                 
##                Accuracy : 0.75                  
##                  95% CI : (0.6301776, 0.8471195)
##     No Information Rate : 0.6764706             
##     P-Value [Acc > NIR] : 0.1203633             
##                                                 
##                   Kappa : 0.4077869             
##  Mcnemar's Test P-Value : 0.6276258             
##                                                 
##             Sensitivity : 0.5454545             
##             Specificity : 0.8478261             
##          Pos Pred Value : 0.6315789             
##          Neg Pred Value : 0.7959184             
##              Prevalence : 0.3235294             
##          Detection Rate : 0.1764706             
##    Detection Prevalence : 0.2794118             
##       Balanced Accuracy : 0.6966403             
##                                                 
##        'Positive' Class : 0                     
## 

Justification for Model Parameters

4. Clustering Methods

4.1. K-means

# K-means on training Data
X = ALL_DATA_x

# Using the elbow method to find optimal number of clusters
# Applying k-means to the dataset
set.seed(13)
kmeans = kmeans(X, 10, iter.max = 500)

# Visualizing library
# install.packages("cluster")
library(cluster)
clusplot(X,
         kmeans$cluster,
         lines = 0, # no line wanted
         shade = TRUE, # shade depending on the denstiy
         color = TRUE,
         labels = 0,
         plotchar = FALSE,
         span = TRUE,
         main = paste("Clusters of Data"),
         xlab = "x-axis",
         ylab = "y-axis")

Justification for K-means Parameters

Initial configuration is fixed. We will run k-means for k = 1:10. vi. Plot error vs k to find optimal number of clusters by using the elbow method.

set.seed(123) 
wcss = vector() # an empty vector
for (i in 1:50) wcss[i] = sum(kmeans(X, i)$withinss)
plot(1:50, wcss, type = "b", main = paste("Clusters"), xlab = "# Clusters", ylab = "Within Cluster SS")

4.2. Hierarchical Clustering

In this section, we also apply hiearchical clustering. In order to understand with linkages work best for the well seperated data, we plot their dendrogram in a for loop. As seen from the dendrograms, the best seperation is obtained when warD is used.

# 2.1. H-clust with different linkages
X = ALL_DATA_x_o
dend = list(list(),list(),list())
meth = c("ward.D", "single", "average")
names(dend) = meth
# Using dendrogram to find the opt num of clusters
for (i in 1:3) {
  dend[i] = list(hclust(dist(X, method = "euclidean"), method = meth[i])) #dist.method: euc #agglom.method: ward
  plot(dend[[i]],
       main = paste("Dendrogram using", meth[i], sep = " " ), # title
       xlab = "Points",
       ylab = paste("Euclidean", "Distance", sep = " ")
  )
}

# Fitting hierarchical clustering to the mall dataset with k = 4 (found using dendrogram)
numClus = 2
hc = hclust(dist(X, method = "euclidean"), method = "ward.D") # same function with different var.name
y_hc = cutree(hc, k = numClus) # cut tree where num.groups is 4

# Visualizing the clusters
# install.packages("cluster")
library(cluster)
clusplot(X,
         y_hc,
         lines = 0, # cluster merkezleri arasi ?izgi
         shade = TRUE,
         color = TRUE,
         labels = 1, # 1: labellanacak noktalari secip goster 2: hepsini goster
         plotchar = FALSE,
         span = TRUE, # cluster icini tarama
         main = paste("Clusters of Well Seperated Data using ward.D"),
         xlab = "X1",
         ylab = "X2")

clus_size = vector(length = numClus)
for (i in 1:length(y_hc)) clus_size[y_hc[i]] = clus_size[y_hc[i]]+1 
show(clus_size)
## [1] 322  19

Justification for H-clustering Parameters

For H-clustering parameters, we first plot the dendogram of the clusters. On this dendogram, we see the separation distance (length) of the linkages. Then, we find the cluster numbers by cutting the tree at maximum length point.as Fitting hierarchical clustering to the mall dataset with k = 5 (found using dendrogram)

4.3. DB-SCAN Clustering

# Compute DBSCAN using fpc package
library("fpc")
set.seed(123)
df = ALL_DATA_x
db <- fpc::dbscan(df, eps = 2.6, MinPts = 3)
# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())

5. Comparison for Classification Models

When the parameters of all models have been set, the following accuracy results were achieved.

From these models, Random Forest and ANN are the models giving the best accuracy results. From these two, ANN is harder to implement whereas Random Forest is a much easier model than the ANN. The problem with Random Forest is that in some cases, the number of trees may get larger and this leads to memory issues.

#models <- c("ANN ","Random Forest", "SVM", "KNN", "Naive Bayesian")
#accuracies <- c(0.82, 0.75, 0.72, 0.69, 0.48)

6. Comparison for Clustering Models

Now, we compare our clustering models using wcss analysis. wcss is a vector of within-cluster sum of squares, one component per cluster. To do this, we begin with an empy wcss vectors and we calculate and sum within ss values of clusters by running the model with 100 different initial configurations.. We can view the sum of within cluster sum of squares error and look at indices with minimum error.

wcss_k = vector() # an empty vector
for (i in 1:100) {
  set.seed(i*20)
  wcss[i] = sum(kmeans(X, 10)$tot.withinss)
} 
plot(20*(1:100), wcss, type = "b", main = paste("Clusters"), xlab = "Initial Seed", ylab = "Within Cluster SS")

which(wcss == min(wcss)) # initial conditions with minimum error
## [1] 29
insens_init = length(which(wcss == min(wcss)))/100
insens_init
## [1] 0.01

In the above analysis, we created kmeans models with different k values (from k=2 to k=10) and initialize them from different initialization points by manipulating the random seed. Then, we sum wcss for each time and compare them against to find insensitivity to initialization point.

In our analysis, we have observed that increasing k-value significantly

7. Conclusions

8. Self Reflectance

8. References

  1. The ECC paper
  2. stackoverflow.com
  3. r-bloggers.com
  4. analyticsvidhya.com